Customer Segmentation and Sales Analysis¶

Goal:
The objective of this project is to better understand the types of customers who purchase Chips and their purchasing behaviour within the region(understanding purchasing trends and behaviour).Additionally we also aim to identify customer segmentation and provide a strategic recommendation to boost sales.

About Datasets:¶

The datasets are provided by Quantium as part of the Forage data analytics simulation. It includes transaction and customer data from a retail store, capturing information about product sales, dates, and customer interactions. The datasets are structured to help analyze sales trends and customer behavior.
Key features include:

  • Product ID: Unique identifier for each product
  • Transaction Date: The date on which the transaction occurred.
  • Sales Amount: The total sales value for each transaction.
  • Customer ID: Unique identifier for each customer.

Installing Libraries¶

In [1]:
%%capture
pip install openpyxl

Importing Libraries¶

In [57]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import math

Importing Datasets¶

In [3]:
fd=pd.read_excel('C:/Users/obalabi adepoju/Downloads/Quantium/transaction_data.xlsx')
df=pd.read_csv('C:/Users/obalabi adepoju/Downloads/Quantium/purchase_behaviour.csv')

We'll be starting with the purchase behaviour dataset.
let's check out the first 10 records of the dataset

In [4]:
df.head(10)
Out[4]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
0 1000 YOUNG SINGLES/COUPLES Premium
1 1002 YOUNG SINGLES/COUPLES Mainstream
2 1003 YOUNG FAMILIES Budget
3 1004 OLDER SINGLES/COUPLES Mainstream
4 1005 MIDAGE SINGLES/COUPLES Mainstream
5 1007 YOUNG SINGLES/COUPLES Budget
6 1009 NEW FAMILIES Premium
7 1010 YOUNG SINGLES/COUPLES Mainstream
8 1011 OLDER SINGLES/COUPLES Mainstream
9 1012 OLDER FAMILIES Mainstream

LYLTY_CARD_NBR : This column represents the loyalty card number of each customer.
LIFESTAGE : This column categorizes customers based on their life stage, indicating the demographic group they belong to.
PREMIUM_CUSTOMER : This column indicates the premium status of the customer, describing their spending behavior.

Data Cleaning and Inspection¶

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   LYLTY_CARD_NBR    72637 non-null  int64 
 1   LIFESTAGE         72637 non-null  object
 2   PREMIUM_CUSTOMER  72637 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB
In [6]:
print(f"This dataset has {df.shape[0]} rows, {df.shape[1]} columns and it has no null values.")
This dataset has 72637 rows, 3 columns and it has no null values.

Let's check for duplicate values in the loyalty card column as this represents our primary_ID

In [7]:
print(f"Duplicate Values: {df['LYLTY_CARD_NBR'].duplicated().any()}")
Duplicate Values: False

Let's examine our LIFESTAGE column.

In [8]:
lifestage_counts=df['LIFESTAGE'].value_counts()
lifestage_counts
Out[8]:
LIFESTAGE
RETIREES                  14805
OLDER SINGLES/COUPLES     14609
YOUNG SINGLES/COUPLES     14441
OLDER FAMILIES             9780
YOUNG FAMILIES             9178
MIDAGE SINGLES/COUPLES     7275
NEW FAMILIES               2549
Name: count, dtype: int64

Let's visualize the distribution of each value.

In [9]:
%matplotlib inline
# Create the histogram
plt = px.histogram(df, x='LIFESTAGE',title='Customer Distribution by Lifestage Category',color='LIFESTAGE')

# Display the chart
plt.show()

Let's look at the premium_customer column

In [10]:
df['PREMIUM_CUSTOMER'].value_counts()
Out[10]:
PREMIUM_CUSTOMER
Mainstream    29245
Budget        24470
Premium       18922
Name: count, dtype: int64
In [11]:
plt = px.pie(df, names='PREMIUM_CUSTOMER', title='Customer Distribution',hole=0.4)

plt.show()

For a better understanding, we'll give a description of what each customer group means.

  1. Mainstream : These customers fall into the middle category of spending. They are not the highest spenders but are consistent and reliable customers. They typically spend more than budget customers but less than premium customers.
  2. Budget : These customers are the lowest spenders. They are likely more price sensitive and look for the best deals and discounts. They make up a significant portion of the customer base totalling 33% but are likely to contribute less to total sales.
  3. Premium : These customers are the highest spenders. They are less price sensitive and often look for high-quality or exclusive products. They are crucial for the business due to their higher spending habits.

Now let us dive into the transaction dataset.

In [12]:
fd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   DATE            264836 non-null  int64  
 1   STORE_NBR       264836 non-null  int64  
 2   LYLTY_CARD_NBR  264836 non-null  int64  
 3   TXN_ID          264836 non-null  int64  
 4   PROD_NBR        264836 non-null  int64  
 5   PROD_NAME       264836 non-null  object 
 6   PROD_QTY        264836 non-null  int64  
 7   TOT_SALES       264836 non-null  float64
dtypes: float64(1), int64(6), object(1)
memory usage: 16.2+ MB
In [13]:
print(f"This dataset has {fd.shape[0]} rows, {fd.shape[1]} columns and it has no null values.")
This dataset has 264836 rows, 8 columns and it has no null values.
In [14]:
# A preview of the dataset showing the first 10 records
fd.head(10)
Out[14]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
0 43390 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0
1 43599 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3
2 43605 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
3 43329 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0
4 43330 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8
5 43604 4 4074 2982 57 Old El Paso Salsa Dip Tomato Mild 300g 1 5.1
6 43601 4 4149 3333 16 Smiths Crinkle Chips Salt & Vinegar 330g 1 5.7
7 43601 4 4196 3539 24 Grain Waves Sweet Chilli 210g 1 3.6
8 43332 5 5026 4525 42 Doritos Corn Chip Mexican Jalapeno 150g 1 3.9
9 43330 7 7150 6900 52 Grain Waves Sour Cream&Chives 210G 2 7.2
  • DATE: An integer column representing the transaction date(encoded in a specific format).
  • STORE_NBR: An integer column representing the store number where the transaction took place.
  • LYLTY_CARD_NBR: An integer column representing the loyalty card number of the customer.
  • TXN_ID: An integer column representing the transaction ID.
  • PROD_NBR: An integer column representing the product number.
  • PROD_NAME: A string column representing the name of the product.
  • PROD_QTY: An integer column representing the quantity of the product purchased in the transaction.
  • TOT_SALES: A floating point column representing the total sales amount for the transaction.

Let's take a look at the product name column.

In [15]:
fd['PROD_NAME'].value_counts()
Out[15]:
PROD_NAME
Kettle Mozzarella   Basil & Pesto 175g      3304
Kettle Tortilla ChpsHny&Jlpno Chili 150g    3296
Cobs Popd Swt/Chlli &Sr/Cream Chips 110g    3269
Tyrrells Crisps     Ched & Chives 165g      3268
Cobs Popd Sea Salt  Chips 110g              3265
                                            ... 
RRD Pc Sea Salt     165g                    1431
Woolworths Medium   Salsa 300g              1430
NCC Sour Cream &    Garden Chives 175g      1419
French Fries Potato Chips 175g              1418
WW Crinkle Cut      Original 175g           1410
Name: count, Length: 114, dtype: int64
In [16]:
print(f"This dataset contains {fd['PROD_NAME'].nunique()} distinct chip names")
This dataset contains 114 distinct chip names

Now we check for duplicates in the loyalty card column

In [17]:
print(f"Duplicate Values: {fd['LYLTY_CARD_NBR'].duplicated().any()}")
Duplicate Values: True
In [18]:
#Removing Duplicates
fd=fd.drop_duplicates(subset=['LYLTY_CARD_NBR'],keep='first')
In [19]:
# Check if it has been removed
print(f"Duplicate Values: {fd['LYLTY_CARD_NBR'].duplicated().any()}")
Duplicate Values: False
In [20]:
fd.info()
<class 'pandas.core.frame.DataFrame'>
Index: 72637 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   DATE            72637 non-null  int64  
 1   STORE_NBR       72637 non-null  int64  
 2   LYLTY_CARD_NBR  72637 non-null  int64  
 3   TXN_ID          72637 non-null  int64  
 4   PROD_NBR        72637 non-null  int64  
 5   PROD_NAME       72637 non-null  object 
 6   PROD_QTY        72637 non-null  int64  
 7   TOT_SALES       72637 non-null  float64
dtypes: float64(1), int64(6), object(1)
memory usage: 5.0+ MB
In [21]:
print(f"This dataset has {fd.shape[0]} rows and {fd.shape[1]} columns and it has no null values.")
This dataset has 72637 rows and 8 columns and it has no null values.

Now we see number of records in transaction data is the same with purchase behaviour.

It's time to combine our dataset together.

In [22]:
data=pd.merge(df, fd, on='LYLTY_CARD_NBR', how='inner')
data.head(10)
Out[22]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER DATE STORE_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
0 1000 YOUNG SINGLES/COUPLES Premium 43390 1 1 5 Natural Chip Compny SeaSalt175g 2 6.0
1 1002 YOUNG SINGLES/COUPLES Mainstream 43359 1 2 58 Red Rock Deli Chikn&Garlic Aioli 150g 1 2.7
2 1003 YOUNG FAMILIES Budget 43531 1 3 52 Grain Waves Sour Cream&Chives 210G 1 3.6
3 1004 OLDER SINGLES/COUPLES Mainstream 43406 1 5 96 WW Original Stacked Chips 160g 1 1.9
4 1005 MIDAGE SINGLES/COUPLES Mainstream 43462 1 6 86 Cheetos Puffs 165g 1 2.8
5 1007 YOUNG SINGLES/COUPLES Budget 43438 1 7 49 Infuzions SourCream&Herbs Veg Strws 110g 1 3.8
6 1009 NEW FAMILIES Premium 43424 1 9 20 Doritos Cheese Supreme 330g 1 5.7
7 1010 YOUNG SINGLES/COUPLES Mainstream 43352 1 10 51 Doritos Mexicana 170g 2 8.8
8 1011 OLDER SINGLES/COUPLES Mainstream 43310 1 12 84 GrnWves Plus Btroot & Chilli Jam 180g 2 6.2
9 1012 OLDER FAMILIES Mainstream 43539 1 16 20 Doritos Cheese Supreme 330g 1 5.7

Exploratory Data Analysis¶

We'll dive into this next bout by specifying some of the questions we'll be answering in our analysis.

Which chip brands do most customers prefer?
Is there a relationship between packet sizes and total sales?
What packet sizes do customers prefer?
Which customer stage and segment has the highest contribution to total sales?
Which customer segment are likely to spend above the average total sales?

Which chip brands do most customers prefer?.¶

We'll be looking at the chip brands customers purchased the most and finding out if there is a reason for the preference.

In [23]:
grouped = data.groupby('PROD_NAME').agg(count=('PROD_NAME', 'count'))
grouped = grouped.sort_values('count', ascending=False).reset_index()  # Sorting by count
display(grouped)
PROD_NAME count
0 Cobs Popd Swt/Chlli &Sr/Cream Chips 110g 967
1 Twisties Cheese Burger 250g 949
2 Smiths Crnkle Chip Orgnl Big Bag 380g 947
3 Pringles Sweet&Spcy BBQ 134g 946
4 Thins Chips Light& Tangy 175g 944
... ... ...
109 Natural ChipCo Hony Soy Chckn175g 342
110 RRD Steak & Chimuchurri 150g 338
111 Red Rock Deli Chikn&Garlic Aioli 150g 333
112 Smiths Chip Thinly CutSalt/Vinegr175g 333
113 WW Crinkle Cut Original 175g 315

114 rows × 2 columns

In [24]:
grouped.describe()
Out[24]:
count
count 114.000000
mean 637.166667
std 268.963067
min 315.000000
25% 372.250000
50% 631.000000
75% 902.750000
max 967.000000

We'll be going into details of what a few of the rows imply

  • Count : There are 114 distinct chip brands.
  • Mean : This represents the average count of chip brands meaning the central point or the general level of chip purchases.
  • SD : This is the standard deviation which measures how close each individual value is to the mean.
  • Min : This is the smallest value in the column.
  • IQR : The interquartile range specifically the 50% value in the table means that half of our data values lie in between 902 and 372. It helps us to determine where the bulk of our data is concentrated.
  • Max : This is the largest value in the table.

Looking at the information above, it suggests that the variability makes it harder to determine a single preferred choice or list of choices that applies broadly to the population.
Let's test out that theory by looking at the top 50 chip brands with the highest purchases

In [25]:
test=grouped[['PROD_NAME','count']].head(50)
test
Out[25]:
PROD_NAME count
0 Cobs Popd Swt/Chlli &Sr/Cream Chips 110g 967
1 Twisties Cheese Burger 250g 949
2 Smiths Crnkle Chip Orgnl Big Bag 380g 947
3 Pringles Sweet&Spcy BBQ 134g 946
4 Thins Chips Light& Tangy 175g 944
5 Doritos Corn Chip Southern Chicken 150g 944
6 Kettle Mozzarella Basil & Pesto 175g 942
7 Tyrrells Crisps Lightly Salted 165g 941
8 Kettle 135g Swt Pot Sea Salt 939
9 Pringles Original Crisps 134g 933
10 Kettle Tortilla ChpsHny&Jlpno Chili 150g 932
11 Tostitos Splash Of Lime 175g 931
12 Doritos Corn Chips Nacho Cheese 170g 924
13 Pringles SourCream Onion 134g 924
14 Kettle Tortilla ChpsBtroot&Ricotta 150g 923
15 Tostitos Smoked Chipotle 175g 923
16 Dorito Corn Chp Supreme 380g 923
17 Cheezels Cheese 330g 918
18 Pringles Barbeque 134g 918
19 Kettle Tortilla ChpsFeta&Garlic 150g 917
20 Kettle Sea Salt And Vinegar 175g 916
21 Infuzions Thai SweetChili PotatoMix 110g 915
22 Kettle Chilli 175g 912
23 Infuzions BBQ Rib Prawn Crackers 110g 912
24 Thins Chips Seasonedchicken 175g 909
25 Pringles Chicken Salt Crips 134g 907
26 Tyrrells Crisps Ched & Chives 165g 905
27 Smiths Crinkle Original 330g 903
28 Cobs Popd Sea Salt Chips 110g 903
29 Smiths Crinkle Chips Salt & Vinegar 330g 902
30 Doritos Corn Chips Cheese Supreme 170g 901
31 Thins Potato Chips Hot & Spicy 175g 900
32 Twisties Chicken270g 899
33 Thins Chips Salt & Vinegar 175g 896
34 Grain Waves Sour Cream&Chives 210G 896
35 Doritos Corn Chip Mexican Jalapeno 150g 895
36 Kettle Honey Soy Chicken 175g 892
37 Kettle Original 175g 891
38 Kettle Sensations Camembert & Fig 150g 890
39 Tostitos Lightly Salted 175g 889
40 Doritos Mexicana 170g 888
41 Infzns Crn Crnchers Tangy Gcamole 110g 884
42 Doritos Cheese Supreme 330g 883
43 Kettle Sensations Siracha Lime 150g 880
44 Old El Paso Salsa Dip Tomato Med 300g 880
45 Grain Waves Sweet Chilli 210g 880
46 Infuzions SourCream&Herbs Veg Strws 110g 879
47 Old El Paso Salsa Dip Chnky Tom Ht300g 877
48 Doritos Corn Chips Original 170g 875
49 Kettle Sensations BBQ&Maple 150g 875

As expected, there is not enough difference between the count values of multiple chip brands to pinpoint the chip preference of customers. To prove this we'll be conducting a hypothesis test to determine if there is a statistical significance between the preference values of chip brands.

Claims :¶

  • Null Hypothesis : There is no significant difference between the sample mean and the population mean.
  • Alternative Hypothesis : There is a significant difference between the sample mean and the population mean.

Significance Level: 0.05 (5%)¶

Critical Value: ±1.96 (for two-tailed test at α=0.05)¶

In [26]:
test_sample=test.sample(25)# use a subset of our test dataframe as the sample
test_sample
Out[26]:
PROD_NAME count
48 Doritos Corn Chips Original 170g 875
3 Pringles Sweet&Spcy BBQ 134g 946
29 Smiths Crinkle Chips Salt & Vinegar 330g 902
16 Dorito Corn Chp Supreme 380g 923
15 Tostitos Smoked Chipotle 175g 923
20 Kettle Sea Salt And Vinegar 175g 916
49 Kettle Sensations BBQ&Maple 150g 875
33 Thins Chips Salt & Vinegar 175g 896
31 Thins Potato Chips Hot & Spicy 175g 900
19 Kettle Tortilla ChpsFeta&Garlic 150g 917
38 Kettle Sensations Camembert & Fig 150g 890
46 Infuzions SourCream&Herbs Veg Strws 110g 879
1 Twisties Cheese Burger 250g 949
44 Old El Paso Salsa Dip Tomato Med 300g 880
45 Grain Waves Sweet Chilli 210g 880
5 Doritos Corn Chip Southern Chicken 150g 944
24 Thins Chips Seasonedchicken 175g 909
13 Pringles SourCream Onion 134g 924
32 Twisties Chicken270g 899
17 Cheezels Cheese 330g 918
4 Thins Chips Light& Tangy 175g 944
37 Kettle Original 175g 891
34 Grain Waves Sour Cream&Chives 210G 896
41 Infzns Crn Crnchers Tangy Gcamole 110g 884
35 Doritos Corn Chip Mexican Jalapeno 150g 895
In [27]:
pop_mean=test['count'].mean()# Calculating the mean of our test population

pop_sd=test['count'].std()#Calculating the standard deviation of our test population

samp_mean=test_sample['count'].mean()# Sample mean 
In [28]:
z_value =1.96 
z_score= abs((samp_mean - pop_mean) / (pop_sd/math.sqrt(50)))
In [29]:
print(f"Z-Score: {z_score}")
Z-Score: 1.2615871363622293

Conclusion: Since test statistic is less than critical value i.e Z_Score < Z_ Value, we accept the null hypothesis and conclude there is no significant difference between the values in the population

Is there a relationship between packet sizes and total sales?.¶

In [30]:
# We'll be creating a new column called PACK_SIZE
data['PACK_SIZE(G)'] = data['PROD_NAME'].str.extract(r'(\d+)')
data['PACK_SIZE(G)']=pd.to_numeric(data['PACK_SIZE(G)'])
In [31]:
sales= data[['LYLTY_CARD_NBR','LIFESTAGE', 'PREMIUM_CUSTOMER','STORE_NBR' ,'TXN_ID', 'PROD_NBR','PROD_NAME','PACK_SIZE(G)','PROD_QTY','TOT_SALES']]
In [32]:
sales.head()
Out[32]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER STORE_NBR TXN_ID PROD_NBR PROD_NAME PACK_SIZE(G) PROD_QTY TOT_SALES
0 1000 YOUNG SINGLES/COUPLES Premium 1 1 5 Natural Chip Compny SeaSalt175g 175 2 6.0
1 1002 YOUNG SINGLES/COUPLES Mainstream 1 2 58 Red Rock Deli Chikn&Garlic Aioli 150g 150 1 2.7
2 1003 YOUNG FAMILIES Budget 1 3 52 Grain Waves Sour Cream&Chives 210G 210 1 3.6
3 1004 OLDER SINGLES/COUPLES Mainstream 1 5 96 WW Original Stacked Chips 160g 160 1 1.9
4 1005 MIDAGE SINGLES/COUPLES Mainstream 1 6 86 Cheetos Puffs 165g 165 1 2.8
In [33]:
sales['TOT_SALES'].describe()
Out[33]:
count    72637.000000
mean         6.976057
std          3.769011
min          1.500000
25%          4.600000
50%          7.400000
75%          8.800000
max        650.000000
Name: TOT_SALES, dtype: float64

In analyzing our total sales column, we discover a highly ambigous value which is completely different from the rest of our values as most are in the range of 4-8.
For a better understanding, we'll query our dataset to understand the outliers.

In [34]:
sales[sales['TOT_SALES'] >= 50]
Out[34]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER STORE_NBR TXN_ID PROD_NBR PROD_NAME PACK_SIZE(G) PROD_QTY TOT_SALES
59694 226000 OLDER FAMILIES Premium 226 226201 4 Dorito Corn Chp Supreme 380g 380 200 650.0

We find out that this is due to a ridiculously large product quantity and there is only one instance of this ambiguity and while it is entirely possible for the product quantity to be 200, we'll consider this entry to be a mistake so as not to affect our analyis.

In [35]:
# we'll replace the value with what we assume to be the correct entry
sales.iloc[59694,8]=2
sales.iloc[59694,9]=6.5
In [36]:
sales.loc[[59694]]
Out[36]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER STORE_NBR TXN_ID PROD_NBR PROD_NAME PACK_SIZE(G) PROD_QTY TOT_SALES
59694 226000 OLDER FAMILIES Premium 226 226201 4 Dorito Corn Chp Supreme 380g 380 2 6.5
In [37]:
# We'll then visualize the relationship between packet sizes and total sales.
sales.plot.scatter(x='TOT_SALES',y='PACK_SIZE(G)',alpha=0.25)
Out[37]:
<Axes: xlabel='TOT_SALES', ylabel='PACK_SIZE(G)'>

Looking at our scatter plot, it's fairly obvious there's no linear correlation between the packet sizes and total sales as the bulk of our data falls between (150 and 250)g as total sales increase.

What packet sizes do customers prefer?¶

We'll start off by grouping our dataset according to packet size and analyzing the sum of total sales for each size and number of occurences.

In [38]:
g=sales.groupby('PACK_SIZE(G)').agg(count=('PACK_SIZE(G)','count'),sales=('TOT_SALES','sum'))
g = g.sort_values('count',ascending=False).reset_index()# Sorting in descending order to view the top 5 chip preferences
g
Out[38]:
PACK_SIZE(G) count sales
0 175 18132 126605.80
1 150 11621 78375.40
2 134 7189 48462.60
3 110 6331 43646.80
4 170 5415 37756.30
5 300 4086 29120.50
6 165 4052 25246.40
7 330 3606 37392.00
8 380 1870 20505.15
9 210 1776 11588.40
10 270 1762 14779.80
11 200 1107 3520.70
12 250 949 7456.20
13 135 939 7110.60
14 160 786 2559.30
15 190 761 3344.70
16 90 757 2247.40
17 70 391 1636.80
18 220 385 1538.70
19 125 366 1310.40
20 180 356 1872.40

As we can see, the count of purchases for the first 5 sizes are in a league of their own compared to the remaining 16 pack sizes with the 175g and 150g pack sizes clearly being the most preferred option even among the top.

For emphasis sake, we'll be visualizing the top 5 pack sizes in comparison with the rest.

In [39]:
test = g.iloc[0:5,0:]
g = g.drop(range(0,5)).reset_index(drop=True)
In [40]:
test1= {"Pack" :['Top_5','others'],
        "count":[test['count'].sum(),g['count'].sum()],
        "sales":[test['sales'].sum(),g['sales'].sum()]
       }
test1=pd.DataFrame(test1)
In [43]:
ax=test1['count'].plot.bar(color='purple')

# Setting the x-axis labels
ax.set_xticklabels(test1['Pack'])

plt.title('Count of Top_5 vs Others')
plt.xlabel('Pack')
plt.ylabel('Count')
Out[43]:
Text(0, 0.5, 'Count')

There appears to be an unexpectedly huge difference between the other pack sizes and customers top 5 choices.
We'll now dive in deeper to see how this affects our total sales

In [44]:
fig, ax = plt.subplots()
ax.pie(test1['sales'], labels=test1['Pack'], autopct='%1.1f%%', startangle=90, colors=['lightblue', 'grey'], wedgeprops=dict(width=0.6))

plt.title('Distribution of Sales for Top_5 vs Others')

plt.show()

As expected, the overall sales of only the preferred pack sizes is relatively larger than the rest combined with sales of top_5 occupying more than half of total sales.

Next we'll be examining the top 5 choices themselves to gain an understanding of their distribution.

In [45]:
ax=test['count'].plot.bar()

# Setting the x-axis labels
ax.set_xticklabels(test['PACK_SIZE(G)'])

plt.title('Distribution of Top Choices')
plt.xlabel('Pack')
plt.ylabel('Count')
Out[45]:
Text(0, 0.5, 'Count')

Looking at the chart, we see there is no correlation between the sizes of the pack and why customers prefer one over the other, this may be due to other factors not considered in our dataset which can be rectified by further data collection or customers simply make their decisions on other qualitative factors such as hand-fit or feel.

We also see customer seem to prefer the 175g size the most.

In [58]:
#Let's see how this affect our sales
ax=test['sales'].plot.bar(color='pink')

ax.set_xticklabels(test['PACK_SIZE(G)'])

plt.title('Distribution of Top Sales')
plt.xlabel('Pack')
plt.ylabel('Sale')
Out[58]:
Text(0, 0.5, 'Sale')

Distribution of total sales seems to be very similar to choice as insights are exactly the same.

Which customer stage and segment has the highest contribution to total sales?¶

We'll begin our analyses by focusing on customer attributes first to find out which lifestages has the highest contribution to Total sales.

In [47]:
#grouping our data by LIFESTAGE and finding sum of each attribute.
sales.groupby('LIFESTAGE').agg(TSALE=('TOT_SALES','sum')).sort_values('TSALE',ascending=False)
Out[47]:
TSALE
LIFESTAGE
RETIREES 103143.50
OLDER SINGLES/COUPLES 103122.70
YOUNG SINGLES/COUPLES 98539.70
OLDER FAMILIES 68468.50
YOUNG FAMILIES 64320.00
MIDAGE SINGLES/COUPLES 50912.70
NEW FAMILIES 17569.25

We see that the older section of our population contribute the most to total sales with the exception of young singles/couples, we assume this is due to a lesser number of dependents in the lifestages of these customers in comparison to the remaining attributes in our data.

In [48]:
# visualizing our findings using a histogram
map={    
    'YOUNG SINGLES/COUPLES': 'blue',
    'YOUNG FAMILIES': 'grey',
    'OLDER SINGLES/COUPLES': 'blue',
    'MIDAGE SINGLES/COUPLES': 'grey',
    'NEW FAMILIES': 'grey',
    'OLDER FAMILIES': 'grey',
    'RETIREES' : 'blue'
    }
# Create the histogram
plt = px.histogram(data, x='LIFESTAGE',y='TOT_SALES',title='Sales Distribution by Lifestage Category',color='LIFESTAGE',color_discrete_map=map)

# Display the chart
plt.show()

We clearly see the 3 major customer lifestages to Total sales indicating their significant purchasing power and influence on sales.

Next we'll focus on customer segments to do a similar analysis.

In [49]:
#grouping our data by customer segments and finding sum of each group.
s=sales.groupby('PREMIUM_CUSTOMER').agg(Group=("PREMIUM_CUSTOMER",'first'),TSALE=('TOT_SALES','sum')).sort_values('TSALE',ascending=False)

Despite premium customers spending more on average compared to other groups, they are relatively rare. This scarcity is why mainstream customers contribute significantly to total sales because they form the bulk of our dataset. In contrast, budget customers, while spending less individually, make up for it with their larger numbers.

In [50]:
plt = px.pie(s, names='Group',values='TSALE',title='Sales Segmentation')

plt.show()

Which customer segment are likely to spend above the average total sales?¶

Firstly let's check what the average total sale of customers is.

In [51]:
sales['TOT_SALES'].describe()
Out[51]:
count    72637.000000
mean         6.967198
std          2.917684
min          1.500000
25%          4.600000
50%          7.400000
75%          8.800000
max         29.500000
Name: TOT_SALES, dtype: float64

We see from our statistical summary of total sales column that the average person spends about $6.96 with the minimum being 1.5 and maximum being 30 although most of our data are in the range of 4.6 to 8.8.
To prove that, let's look at the total sales of customers greater than 9.0.

In [52]:
d=sales.TOT_SALES[sales['TOT_SALES']>9.0].count()
d
Out[52]:
16965
In [53]:
print(f"Sales greater than 9.0 is only about {round(((d/72637) * 100),2)} % of our data.")
Sales greater than 9.0 is only about 23.36 % of our data.
In [54]:
#Let's look at the average sale of eac customer group.
s=sales.groupby('PREMIUM_CUSTOMER').agg(AVG_SALE=('TOT_SALES','mean')).sort_values('AVG_SALE',ascending=False)
s[['AVG_SALE']]
Out[54]:
AVG_SALE
PREMIUM_CUSTOMER
Mainstream 7.072459
Premium 6.902973
Budget 6.891060

We see that the differences in average sale between each customer group and average total sale is not very significant enough to be considered.

Insights & Recommendations¶

  1. Customer Preferences : Our analysis involved grouping the dataset by chip names to determine customer preferences. We provided a statistical summary of purchase counts for each chip brand, identifying the top choices. After examining multiple records, we found no significant differences in the purchase values among chip brands. This conclusion was further supported by a hypothesis test we conducted.
  • I recommend conducting targeted surveys to help uncover specific customer preferences and improve product offerings.
  1. Pack Size vs Total Sales : We sought to determine whether an increase in pack sizes would lead to higher total sales, specifically assessing if customers would be willing to spend more for larger quantities. However, the scatterplot revealed no significant linear correlation between pack size and total sales.
  • Our analysis suggests that consumer spending behavior does not align with expectations regarding quantity of products.
  1. Pack Size Preferences : Our analysis began by categorizing data based on pack sizes to evaluate purchase counts and total sales for each chip brand. We identified that 175g and 150g pack sizes emerged as the most popular among customers. Furthermore, we found that the top five sellers significantly outperformed all other pack sizes in terms of total sales. This suggests that customer preferences may be influenced by factors not captured in our dataset, such as product feel or convenience.
  • I recommend to optimize sales, we should focus on managing inventory effectively to ensure that popular pack sizes, like 175g and 150g, are consistently available. Additionally, implementing strategic discounts or promotions on these sizes could further boost customer purchases and enhance overall sales performance.
  1. Customer Segmentation

Our analysis started with identifying which customer lifestages significantly contribute to total sales. We found that older singles/couples and retirees are the primary drivers of revenue, followed closely by young singles/couples. This trend likely arises from fewer dependents in these groups compared to others. Furthermore, we shifted our focus to premium customer groups and found out that mainstream customers stand out due to their higher numbers, enhancing their overall contribution to total sales.

We went even further in our analysis by grouping each premium group according to lifestage. We'll look at an explanatory visualization to explain our next insight.

newplot%281%29.png

Notably, retirees and older singles/couples emerge as key attributes across all segments, explaining their substantial impact on sales figures. Additionally, mainstream customers tend to have a significantly larger representation of young singles/couples enabling us to understand the key demographics in our customer segmentation.

  • I recommend focusing your marketing efforts on retirees and older singles/couples, as they tend to be the most significant spenders such as the use of targeted advertising to showcase location of stores and what they offer on outlets they engage with and also develop special promotions and products tailored to their needs and preferences. Additionally, for mainstream customers, particularly young singles and couples. Create campaigns that resonate with their interests to drive sales, as they make up a substantial portion of your customer base.

Conclusion :¶

In summary, our analysis highlights the importance of targeting key customer segments and focusing on popular pack sizes, such as 175g and 150g, to maximize sales and optimize marketing efforts. By aligning product offerings with customer preferences and effectively managing inventory, we can enhance overall sales performance.